Search CORE

25 research outputs found

Beyond the One Step Greedy Approach in Reinforcement Learning

Author: Dalal Gal
Efroni Yonathan
Mannor Shie
Scherrer Bruno
Publication venue
Publication date: 10/07/2018
Field of study

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g,

n

-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.Comment: ICML 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Reinforcement Learning with Trajectory Feedback

Author: Efroni Yonathan
Mannor Shie
Merlis Nadav
Publication venue
Publication date: 04/03/2021
Field of study

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.Comment: AAAI202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

Author: Dalal Gal
Efroni Yonathan
Mannor Shie
Scherrer Bruno
Publication venue
Publication date: 20/09/2018
Field of study

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.Comment: NIPS 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Tractable Optimality in Episodic Latent MABs

Author: Caramanis Constantine
Efroni Yonathan
Kwon Jeongyeol
Mannor Shie
Publication venue
Publication date: 05/10/2022
Field of study

We consider a multi-armed bandit problem with

M

latent contexts, where an agent interacts with the environment for an episode of

H

time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. The resulting partial observation of the environment makes the learning task significantly more challenging. Without any additional structural assumptions, existing techniques to tackle partially observed settings imply the decision maker can learn a near-optimal policy with

O(A)^H

episodes, but do not promise more. In this work, we show that learning with {\em polynomial} samples in

A

is possible. We achieve this by using techniques from experiment design. Then, through a method-of-moments approach, we design a procedure that provably learns a near-optimal policy with

O(\texttt{poly}(A) + \texttt{poly}(M,H)^{\min(M,H)})

interactions. In practice, we show that we can formulate the moment-matching via maximum likelihood estimation. In our experiments, this significantly outperforms the worst-case guarantees, as well as existing practical methods.Comment: NeurIPS 202

arXiv.org e-Print Archive